The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the recent tools currently shaping the field.
Enhanced Variant Calling
Among the many groups responsible for today’s analysis tools, one laboratory stands out due to its impressive output and continual contributions. Fritz Sedlazeck and his team of skilled bioinformaticians at Baylor College of Medicine have developed numerous influential analysis tools, specializing in the detection and understanding of structural variations (SVs). Focusing primarily on large-scale genomics, the team applies novel technologies and develops innovative methods and algorithms to understand human genome complexity and disease associations.
Their most recent work, spearheaded by post-doc Luis Paulin, focuses on Sniffles, their popular structural variant caller. Published in Nature Biotechnology, this updated version (v2) is used for detecting SVs on germline, somatic, and population-level long-read data1. Sniffles2 features enhanced accuracy and faster performance than its predecessor. In particular, this newer version is designed for population-scale SV calling and enables the analysis of complex cases like tumor/normal comparisons and family studies. Sniffles2 also supports the detection of low-frequency SVs, which aids in the study of somatic SVs, mosaicism, and cellular heterogeneity.
An upcoming project from one of the lab’s post-docs, Michal Izydorczyk, discusses the applications of single-cell long-read whole-genome sequencing, and software designed to detect novel transposon activities in individual brain cells. Izydorczyk focuses on multiple-system atrophy (MSA), a disease similar to Parkinson's, by analyzing structural variations like SNVs, insertions, and deletions in single-cell DNA from MSA brains. His research uniquely targets variations specific to single cells that are not present in bulk tissue, potentially influencing the disease. Furthermore, he has developed a pipeline to map the most active transposons (SINE and LINE1) within these variations.
DRAGEN Secondary Analysis and Genomic AI Tools
The Sedlazeck group also collaborates with key organizations leveraging the latest analytical tools. Sairam Behera, an established member of the team, recently released a new preprint on DRAGEN (Dynamic Read Analysis for GENomics), an Illumina-based analysis software and workflow that accelerates the detection of various genetic variations, including SNVs, indels, SVs, CNVs, and STRs2. This tool was shown to significantly reduce computational time to approximately 30 minutes for variant detection from raw data and it excels in accuracy and speed compared to other advanced methods.
Rami Mehio, Head of Software and Informatics at Illumina, explained that DRAGEN secondary analysis continues to innovate, setting new standards for accuracy as demonstrated by the Precision FDA challenges, and provides comprehensive coverage of the genome. Mehio shared that DRAGEN 4.3, coming in June 2024, will include a next-generation multi-genome graph mapper that can now incorporate hundreds of high-quality assemblies such as those built by the Human Pangenome Reference Consortium (HPRC); a new machine learning mosaic calling model in the small variant caller; and a new family of specialized callers called MRJD (Multi-Region Joint Detection) that can be used for screening in paralogous regions for mutations in difficult genes such as PMS2 for hereditary cancer, SMN1, STRC, NEB for carrier screening and TTN and IKBKG for newborn screening. This is in addition to the advanced targeted callers that Illumina DRAGEN provides including the start allele caller and callers for difficult genes like CYP2D6, CYP2B6, and HLA enabling significant PGx research capabilities, as well as callers for LPA, HBA, and more that enable cardiovascular disease research and carrier screening.
“Most of the advancement in our informatics suites stems from deep collaborations with customers and finding solutions that help them succeed,” shared Mehio. “The work on the multigenome mapper began three years ago when we realized a haploid reference is limiting, especially in highly polymorphic regions of the genome or in areas where mapping is difficult, and with this latest update, it enables us to provide our most accurate secondary analysis solution to date.”
In addition to DRAGEN, Deepthi Shankar, Sr Staff Product Marketing of Analytics, Software, and Informatics at Illumina, highlighted the company’s Connected Software suite, a comprehensive suite of software products designed to integrate with Illumina sequencers, that supports the entire NGS workflow, encompassing lab and sample management, analysis, interpretation, and reporting across various applications like genetic disease, oncology, multiomics, and more.
The implementation of genomic artificial intelligence (Genomic AI) is a major focus of innovation at Illumina. Developments like their Emedgene platform are equipped with explainable AI (XAI) to enhance research in rare and hereditary diseases by providing evidence-backed insights for various assays3. “Importantly, it is also designed and developed with the ability to take advantage of all advanced variant calling features of the DRAGEN pipelines, such as STR, SV, SMA, and HBA calling, making the combination a great asset in the hands of rare disease researchers,” noted Mehio.
The Illumina Artificial Intelligence Lab, led by Kyle Farh, Vice President of Artificial Intelligence at Illumina, has also developed new AI algorithms—PrimateAI-3D and SpliceAI—to address the challenge of understanding the medical significance of mutations beyond well-known coding regions. PrimateAI-3D is trained on sequencing data from 233 primate species and predicts pathogenic mutations in humans, while SpliceAI assesses the impact of variants on gene splicing. These tools have been crucial for analyzing large biobank datasets for genetic risk prediction and drug target discovery for common illnesses such as cardiovascular disease, type 2 diabetes, autoimmune diseases. The Connected Software suite and other Illumina tools, Mehio noted, can be integrated into existing analysis workflows and are compatible with various analysis tools and databases.
Shankar highlighted the impact of these tools, noting that DRAGEN has repeatedly set benchmarks for speed and accuracy in genome analysis with multiple Precision FDA awards for somatic and germline pipelines since achieving a Guinness World Record in 2017. Furthermore, the continued improvements in DRAGEN's technology have supported numerous significant research studies, including those in large-scale genomic projects like the UK Biobank and the All of Us program. For the future of their analysis tools, Shankar explained that they will continue to invest in informatics solutions that enable their customers to get a more accurate and comprehensive genome with connected, AI-powered solutions.
HiFi Sequencing Tools
The latest collaboration from the Sedlazeck group features Medhat Mahmoud, an accomplished post-doc, working with Pacific Biosciences (PacBio) and Twist Bioscience on a pre-print for the Twist Alliance Dark Genes Panel (TADGP)4. The panel was designed to address the challenge of sequencing "dark regions" of the human genome, which are poorly represented in short-read sequencing data. Using PacBio’s HiFi long-read sequencing, TADGP efficiently resolved complex autosomal genes at a lower cost compared to whole-genome sequencing. This work also involved the development of a new analytical workflow that integrates mapping, assembly, and targeted callers that demonstrated high accuracy across samples.
Along with their sequencing technologies, Mike Eberle, Vice President for Computational Biology at PacBio, explained that their group has developed numerous innovative sequencing analysis tools tailored to exploit the advantages of HiFi sequencing data. “They address a wide range of variant calling and analysis aspects such as tandem repeats, genes in segmental duplications, pharmacogenomics, HLA star alleles, haplotyping, structural variant calling, and methylation analysis,” he stated.
The motivation behind these tools was to address the limitations of existing sequencing technologies and software. Eberle shared that PacBio collaborates with clinical geneticists to identify key issues in the field that their tools now address. “Our tools were created to either improve on existing solutions, fill a gap in available tools, or ensure rapid response to customer concerns,” noted Eberle.
The primary difference between PacBio’s tools and existing solutions is the focus on long reads, which Eberle explained sets them apart in a field full of methods for short reads. “Our tools have been tailored to work seamlessly with data from our sequencing technology, allowing for more accurate and reliable variant calling.” Designed to work together for maximum efficiency and accuracy, these tools are also compatible with existing bioinformatics pipelines and can be easily incorporated into other genomic analysis tools and databases.
Among their many tools, Eberle highlighted the success of Paraphase , their segmental duplication caller.5 This informatics method was developed to analyze the SMN1 and SMN2 genes, which are crucial in spinal muscular atrophy. Paraphase uses PacBio HiFi data to identify haplotypes, determine gene copy numbers, and call phased variants with high accuracy. “Using Paraphase, we identified pathogenic variants in segmental duplications that were not detectable by other technologies or other software tools,” stated Eberle. “In addition, the analysis of SMN1/SMN2 showed that we could identify distinct SMN1/2 haplotypes in the population and that some of these haplotypes correlated well with so-called silent carriers where individuals had one chromosome with no copies of SMN1 (and were thus carriers for spinal muscular atrophy).”
Looking ahead, Eberle explained that PacBio plans to develop tools for aggregating variant calls from multiple genomes, and improve annotation and comparison with population data, while also enhancing visualization capabilities. Researchers are encouraged to use these tools and provide feedback on any roadblocks faced to help further refine and innovate the field of sequencing analysis.
References
- Smolka M, Paulin LF, Grochowski CM, et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology. Published online 2024. doi: https://doi.org/10.1038/s41587-023-02024-y
- Behera S, Catreux S, Rossi M, et al. Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms. bioRxiv. Published online January 1, 2024:2024.01.02.573821. doi:https://doi.org/10.1101/2024.01.02.573821
- Meng L, Attali R, Talmy T, et al. Evaluation of an automated genome interpretation model for rare disease routinely used in a clinical genetic laboratory. Genetics in Medicine. 2023;25(6). doi:https://doi.org/10.1016/j.gim.2023.100830
- Mahmoud M, Harting J, Corbitt H, et al. Closing the gap: Solving complex medically relevant genes at scale. medRxiv. Published online January 1, 2024:2024.03.14.24304179. doi:https://doi.org/10.1101/2024.03.14.24304179
- Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;110(2):240-250. doi:https://doi.org/10.1016/j.ajhg.2023.01.001